We revisit a simple Learning-from-Scratch baseline for visuo-motor control that uses data augmentation and a shallow ConvNet. We find that this baseline has competitive performance with recent methods that leverage frozen visual representations trained on large-scale vision datasets.
translated by 谷歌翻译
模型推理的成本效率对于现实世界机器学习(ML)应用至关重要,尤其是对于延迟敏感的任务和资源有限的设备。一个典型的困境是:为了提供复杂的智能服务(例如智能城市),我们需要多种ML模型的推理结果,但是成本预算(例如GPU内存)不足以运行所有这些结果。在这项工作中,我们研究了黑盒ML模型之间的基本关系,并提出了一项新的学习任务:模型链接,该任务旨在通过学习映射(配音模型链接)之间的输出空间之间的学习映射(配音模型链接)来弥合不同的黑盒模型的知识。我们提出了模型链接的设计,该链接支持链接异质的黑盒ML模型。同样,为了解决分布差异挑战,我们提出了模型链接的适应和聚合方法。根据我们提出的模型链接,我们开发了一种名为MLINK的调度算法。通过通过模型链接启用的协作多模型推断,麦克林可以提高成本预算下获得的推理结果的准确性。我们在具有七个不同的ML型号和两个现实世界的视频分析系统和3,264小时的视频上评估了多模式数据集上的麦克林。实验结果表明,我们提出的模型链接可以在各种黑盒模型之间有效构建。在GPU内存的预算下,MLINK可以节省66.7%的推理计算,同时保留94%的推理准确性,这表现优于多任务学习,基于强化的基于强化的计划调度程序和框架过滤基线。
translated by 谷歌翻译
以移动为中心的AI应用程序对模型推断的资源效率有很高的要求。输入过滤是消除冗余以降低推理成本的有前途的方法。以前的努力已经针对许多应用程序量身定制了有效解决方案,但是尚未解决两个基本问题:(1)推理工作量的理论滤波器可指导输入过滤技术的应用,从而避免了资源受限的移动应用程序的试用成本; (2)功能嵌入的可辨别性可允许输入过滤对各种推理任务和输入内容有效。为了回答它们,我们首先将输入过滤问题正式化,理论上比较了推理模型和输入过滤器的假设复杂性,以了解优化潜力。然后,我们提出了第一个端到端可学习的输入过滤框架,该框架涵盖了大多数最先进的方法,并以可强大的可区分性嵌入功能。我们设计和实施支持六种输入方式和多个以移动为中心的部署的INFI。综合评估证实了我们的理论结果,并表明INFI在适用性,准确性和效率方面的表现优于强大的基准。 INFI获得8.5倍的吞吐量并节省95%的带宽,同时保持超过90%的精度,以用于移动平台上的视频分析应用程序。
translated by 谷歌翻译
时空视频接地(STVG)的重点是检索由自由形式的文本表达式描绘的特定物体的时空管。现有方法主要将这一复杂的任务视为平行框架的问题,因此遭受了两种类型的不一致缺点:特征对齐不一致和预测不一致。在本文中,我们提出了一个端到端的一阶段框架,称为时空的一致性变压器(STCAT),以减轻这些问题。特别是,我们引入了一个新颖的多模式模板,作为解决此任务的全球目标,该目标明确限制了接地区域并将所有视频框架之间的预测联系起来。此外,为了在足够的视频文本感知下生成上述模板,提出了一个编码器架构来进行有效的全局上下文建模。由于这些关键设计,STCAT享有更一致的跨模式特征对齐和管预测,而无需依赖任何预训练的对象探测器。广泛的实验表明,我们的方法在两个具有挑战性的视频基准(VIDSTG和HC-STVG)上胜过先前的最先进的,这说明了拟议框架的优越性,以更好地理解视觉与自然语言之间的关联。代码可在\ url {https://github.com/jy0205/stcat}上公开获得。
translated by 谷歌翻译
近年来,已经开发出各种基于梯度的方法来解决机器学习和计算机视觉地区的双层优化(BLO)问题。然而,这些现有方法的理论正确性和实际有效性总是依赖于某些限制性条件(例如,下层单身,LLS),这在现实世界中可能很难满足。此外,以前的文献仅证明了基于其特定的迭代策略的理论结果,因此缺乏一般的配方,以统一分析不同梯度的BLO的收敛行为。在这项工作中,我们从乐观的双级视点制定BLOS,并建立一个名为Bi-Level血液血统聚合(BDA)的新梯度的算法框架,以部分地解决上述问题。具体而言,BDA提供模块化结构,以分级地聚合上层和下层子问题以生成我们的双级迭代动态。从理论上讲,我们建立了一般会聚分析模板,并导出了一种新的证据方法,以研究基于梯度的BLO方法的基本理论特性。此外,这项工作系统地探讨了BDA在不同优化场景中的收敛行为,即,考虑从解决近似子问题返回的各种解决方案质量(即,全局/本地/静止解决方案)。广泛的实验证明了我们的理论结果,并展示了所提出的超参数优化和元学习任务算法的优越性。源代码可在https://github.com/vis-opt-group/bda中获得。
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
To generate high quality rendering images for real time applications, it is often to trace only a few samples-per-pixel (spp) at a lower resolution and then supersample to the high resolution. Based on the observation that the rendered pixels at a low resolution are typically highly aliased, we present a novel method for neural supersampling based on ray tracing 1/4-spp samples at the high resolution. Our key insight is that the ray-traced samples at the target resolution are accurate and reliable, which makes the supersampling an interpolation problem. We present a mask-reinforced neural network to reconstruct and interpolate high-quality image sequences. First, a novel temporal accumulation network is introduced to compute the correlation between current and previous features to significantly improve their temporal stability. Then a reconstruct network based on a multi-scale U-Net with skip connections is adopted for reconstruction and generation of the desired high-resolution image. Experimental results and comparisons have shown that our proposed method can generate higher quality results of supersampling, without increasing the total number of ray-tracing samples, over current state-of-the-art methods.
translated by 谷歌翻译
Panoptic Part Segmentation (PPS) unifies panoptic segmentation and part segmentation into one task. Previous works utilize separated approaches to handle thing, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework named Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we make the following contributions: Firstly, we design a meta-architecture that decouples part feature and things/stuff feature, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Secondly, we propose a new metric Part-Whole Quality (PWQ) to better measure such task from both pixel-region and part-whole perspectives. It can also decouple the error for part segmentation and panoptic segmentation. Thirdly, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross attention scheme to further boost part segmentation qualities. We design a new part-whole interaction method using masked cross attention. Finally, the extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results with a significant cost drop of 70% on GFlops and 50% on parameters. Our models can serve as a strong baseline and aid future research in PPS. Code will be available.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
This paper illustrates the technologies of user next intent prediction with a concept knowledge graph. The system has been deployed on the Web at Alipay, serving more than 100 million daily active users. Specifically, we propose AlipayKG to explicitly characterize user intent, which is an offline concept knowledge graph in the Life-Service domain modeling the historical behaviors of users, the rich content interacted by users and the relations between them. We further introduce a Transformer-based model which integrates expert rules from the knowledge graph to infer the online user's next intent. Experimental results demonstrate that the proposed system can effectively enhance the performance of the downstream tasks while retaining explainability.
translated by 谷歌翻译